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written in target language. Many source and target 
SYSTEM AND METHOD FOR COMPILING A languages are known. For example, source languages 
FINE-GRAINED ARRAY BASED SOURCE include: APL, Ada, Pascal. Fortran, C. and Lisp. Tar- 

PROGRAM ONTO A COURSE-GRAINED get languages include machine languages for computers 

HARDWARE 5 having one or a great number of processors. Compilers 

which support parallel data processing allow the defini- 
CROSS-REFERENCE TO OTHER t^ns and use of parallel variables. For reference pur- 

APPLICATIONS poses, such compilers are called data parallel compilers. 

The following applications are assigned to the as- For Example, the Connection Machine ® (CM) com- 
aignee of the present application: 10 puter CM-2 system, designed by Thinking Machines 

U.S. patent application Set. No. 07/042,761, filed Corp., Cambridge, Mass. 0.2142, is a massively parallel 
Apr. 27, 1987, by W. Daniel Hillis, entitled "Method computer with up to 65,536 bit serial processor? and 
and Apparatus for Simulating M-Dimensional Connec- 2048 floating point accelerator chips. The CM-2 
tion Networks in an N-Dimensional Network Where M evolved out of the CM- 1, which did not have any float- 
is Less Than N'\ is incorporated herein by reference in 15 ing point hardware. The primary interface used by 
its entirety. CM-2 compilers has been the Paris assembly language. 

U.S. patent application entitled "System and Method The Paris language is a low-level instruction set for 
for Compiling Towards a Super-Pipelined Architec- programming the data parallel computer. The Paris 
ture", Ser. No. 07/830,564, filed Feb. 3, 1992 is incorpo- language is described in the Thinking Machines Corpo- 
rated herein by reference in entirety. *> m \on documents Paris Reference Manual (Version 6.0. 

U.S. patent application entitled "System and Method February 1991) and Revised Paris Release Notes (Ver- 
for ^^^J^y BXm^to Pn^ng Elements", sion Mf February 1991). These documents are available 
Ser. No. 07/829,482, filed Feb. 3, 1992, is incorporated from the Thinking Machines Corporation Customer 
herein by reference m its entirety s Department at 245 First Street, Cambridge, 

U.S. patent application entitled "System and Method 25 Mass . Even ^ugh Paris * implemented in a way that 

m r!*, ^^oo, S,rUCtl0nS \J?' * N k°* thc underlying floating point hardware to perform 
07/827,945, filed Feb 3, 1992 is incorporated herein by it stilI reflects thV fat that the CM-lhad no 
reference in its entirety, registers: All Paris operations (also called fieldwise 
INCORPORATION BY REFERENCE 30 operations) are memory to memory. This places a mem- 
US Pat. No. 4,598,400 issued Jul. 1, 1986, to W. 0T * ^ d 7 dth lbnil on P^^gaflop rating of Paris, 
Daniel Hillis, for "Method and Apparatus for Routing f 1 * thcref ££ ° n «P*" whose target « the Pans 
Message Packets", and assigned to\he assignee of the Ian ^ a 1 f c ' ^™ns approximately 1.3-2.5 gigaflops 
present application, is incorporated herein by reference ™ a fu " « C **\ 2 < WK senal processors 2K FPUs), 
in its entirety 35 Thc m S ncr speeds can be attained by multiply-add in- 

U.S. Pat. No. 4,773,038, issued Sep. 20, 1988, to Hillis str °? tio I^ * hich ™ ^ ^ fuI in fP* 5 *** «^ons. 
et al.. for "Method of Simulating Additional Processors ^ CM : 2 P rov «*« ***** dimensions of parallelism: 
in A SIMD Parallel Processor Array", and assigned to superpipcrlines, superscalar, and multiple processors. A 
the assignee of thc present invention, is incorporated by more in dc P th discussion of the these concepts can be 
reference in its entirety. 40 found m Almasi et al. t Highly Parallel Compiling. Ben- 
US. Pat. No. 4,827,403, issued May 2, 1989, to Steele, jamm/Cummings Publishing Co. (1989), Hennessy et 
Jr. et al., for "Virtual Processor Techniques in a SIMD Computer Architecture A Quantitative Approach 
Multiprocessor Array", and assigned to the assignee of Morgan Kaufmann Publishers (1990) and Johnson, 5w- 
the present invention, is incorporated herein by refer- perscaler Microprocessor Design, Prentice-Hall (1991) 
ence in its entirety. 45 w ™'ch are hereby incorporated by reference in their 

U.S. Pat. No. 4,984,235, issued Jan. 8, 1991, to Hillis entirety herein, 

et al., for "Method and Apparatus for Routing Message Generally, pipelining is an implementation technique 

Packets and Recording the Routing Sequence", and whereby multiple instructions are overlapped in execu- 

assigned to the assignee of the present application, is tion Today, pipelining is one of the key implementation 

incorporated herein by reference in its entirety. 50 techniques used to build fast processors. A pipeline is 

like an assembly line: Each step is the pipeline com- 

BACKGROUND OF THE INVENTION plctes a ^ J thc mstruct i 0 ^ Each of \hc steps is 

1. Field of the Invention called a pipe stage or pipe segment. The stages are 
The invention relates generally to a system and connected one to the next to form a pipe— instructions 

method of compiling a computer program, and more S3 enter at one end, are processed through the stages, and 

particularly, to a system and method for compiling a exit at the other end. Pipelining is an implementation 

computer program wherein the computer program is technique that exploits parallelism among the inslruc- 

adapted for use with a data parallel computer. tions in a sequential instruction stream. It has a subs tan - 

2. Discussion of Related Art tial advantage over scalar sequential processing. 

A compiler is a computer program which receives a 60 The throughput of the pipeline is determined by hot 

source program as input. The source program is written often an instruction exits the pipeline. Because the pipe 

in a source language. The compiler translates the source stages are hooked together, all the stages must be ready 

program into an equivalent target program. As a gen- to proceed at the same time. The time required between 

era! reference that describes thc principles used to dc- moving an instruction one step down the pipeline is a 

sign compilers for serial computers see Aho et al., Com- 65 machine cycle. Pipelining yields a reduction in the aver- 

pilers. Principles, Techniques and Tools., Addison-Wesley age execution time per instruction. 

Publishing Co., (1988) which is hereby incorporated by The term superscalar describes a computer imple- 

reference in its entirety herein. The target program is mentation that improves performance by concurrent 
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execution of scalar instructions— Superscalar proces- archy. thus allowing the machine to go beyond the 
sors typically allow the widening the processors* pipe- memory bottleneck, 
line. Widening the pipeline makes it possible to execute 

more than one instructions per cycle. Thus, superscalar SUMMARY OF THE INVENTION 

refers to issuing more than one instruction per clock 5 In view of the foregoing, the present invention pro- 
cyce. This allows the instruction-execution rate to ex* vides a parallel vector machine model for building a 
ceed the clock rate. compiler that exploits three different levels of parallel- 

In regard to the issue of multiple processor, designers ism found in a variety of parallel processing machines 
of parallel computers tried a variety of methods in order (e.g., the Connection Machine (g) Computer CM-2 sys- 
to fully utilize the underlying hardware. For example, 10 tern). The fundamental idea behind the parallel vector 
earlier parallel computer systems assumed a separate machine model is to have a target machine that has a 
processor for every data element, so that one may eftec- collection of thousands of vector processors each with 
lively operate on all data elements in parallel. When one its own interface to memory. Thus allowing a fine- 
such instruction is is used, it is performed (possibly grained array-based source program to be mapped onto 
conditionally) by every hardware processor, each on its 15 a course-grained hardware. 

own data. Many of the usual arithmetic and logic in- In the parallel vector machine model used by CM 
stmctions found in contemporary computer instruction Fortran 1.0, the FPUs, their registers, and the memory 
sets (such as, substrate, multiply, divide, max, min, com- hierarchy are directly exposed to the compiler. Thus, 
pare, logical and, logical or, logical exclusive or, and the CM-2 target machine is not 64K simple bit-serial 
floating point instructions) are provided in this form. 20 processors. Rather, the target is a machine containing 
A typical difficult with these computer systems is 2K PEs (processing elements), where each PE is both 
when the number of data elements in the problem to be superpipelined and superscalar. The compiler uses a 
solved exceeds the number of hardware processors. For data distribution to spread the problem out among the 
example, if a machine provides 16,384 processors con- 2K processors. A new compiler phase is used to sepa- 
figured in a 128 X 128 two dimensional grid, and a prob- 25 rate the code that runs on the two types of processors in 
lem requires the processing of 200 X 200 elements (total the CM-2; the parallel PEs, which execute a new RISC- 
40,000), the programming task is much more difficult like instruction set called PEAC, and the front end 
because one can no longer assign one data element to processor, which executes SPARC or VAX assembler 
each processor, but must assign two data elements to code. The pipelines in PEs are filled by using vector 
some processors. Even if a problem requires no more 30 processing techniques along with the PEAC instruction 
than 16,384 data elements, if they are to be organized as set. A scheduler overlaps the execution of a number of 
a 64 x 256 grid rather than a 128x128 pattern, program- RISC operations. 

rning is again complicated, this time because the prob- In particular, the methodology involved in utilizing 
lem communication structure does not match the hard- the parallel vector machine model in the CM-2 corn- 
ware communication structure. 35 prised handcrafting the best possible microcode for the 
One solution to this problem was described in U.S. benchmark kernels, defining a RISC-like assembly lan- 
Pat. No. 4,827,403 to Steel, Jr. et al. The '403 patent guage that could be assembled into microcode with 
describes a virtual processor mechanism which causes similar performance properties, and then designing a 
every physical hardware processor to be used to simu- compiler to generate that new assembly language. The 
late multiple virtual processors. Each physical proces- 40 new assembly language produced a new model of the 
sor simulates the same number of virtual processors. CM-2, one in which the key element was the FPU and 
However, the virtual processor model creates an artific- than the bit-serial processor. 

ial memory hiearchy. For example, FIG. I has sixteen To implement the parallel vector machine model the 
virtual processor on one of the bit serial processors. The 64-bit floating-point accelerators of the CM-2 are used 
memory (m) would get sub-divided into sixteen blocks 45 as the basic physical processing elements. By treating 
(m/16). The elements of an array (A(O)-A(N)], where the CM-2 as a set of vector processors, instead of a set 
the array element A(0- A(16) are placed in the sixteen of bit-serial processors, and thus using the floating-point 
virtual processors. This creates a problem. The gap registers to avoid many read/ writes to memory, the 
between the elements, as shown at reference number parallel vector machine model excels at performing 
550, is very large. This creates a very series memory 50 elemental operations on floating-point and integer data, 
performance degradation. Instead of having one cycle The driving factor behind the development of the slice- 
per access you get two cycles per access or more. This wise model is the performance potential of using the 
division of memory is not what the user wanted. The registers and vector-processing capabilities of the units 
user wanted to put sixteen elements next to each other (chips) of the 64-bit floating point accelerator, 
and operate on them. 55 Special microcode can make explicit use of the FPU 

The goal of the compiler designer is to try and exploit registers as the source of operands and the destination of 
all three levels of parallelism (i.e., superpipelined super- results for elemental computations, thus avoiding mem- 
scalar, and multiple processors. This has presented a ory loads and stores. Also, the 64-bit FPU can be used 
substantial problem. As stated by Hennessy et al. (pg. as a vector processor— actually, a set of vector proces- 
581). compilers of the future have two obstacles to 60 sors, each with a vector length of 4. The theoretical 
overcome: (1) how to lay out the data to reduce mem- peak performance of such code (that is with loads and 
ory hierarchy and communication overhead, and (2) stores and without communication) is 14 G flops for a 
exploitation of parallelism. Parallelizing compilers have full -sized CM-2. The goal of the slicewise execution 
been under development since 1975 but progress has model is to allow CM Frotran programs to exploit this 
been slow. 65 performance enhancement. 

Thus, it would be advantageous to provide a system The compiler itself does not perform CM-2 memory 
and method for exploiting the inherent parallelism of management or interprocessor (meaning intcr-PE) com- 
parand target machines and reducing the memory hicr- munication. Instead, it call the functions of a run-time 
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library. The fun-time system lays out arrays in CM-2 N-Dimensional Network Where M is Less Than NT, 
memory differently depending on the number of PEs describes, in a massively parallel computer including 
available to execute the program. Since array layout is processor chips interconnected by a hypercube, an ar- 
not determined at compile time, you can run a CM rangement for emulating 2-, 3-, or higher dimensional 
Fortran program on any size CM system without re- 5 nearest-neighbor communication network ("NEWS") 
cording or recompiling As with the Paris model, the between chips using the hypercube wires 
(physical) PE loops over the array elements assigned to The present invention provides a parallel vector ma- 
it, repeating each instruction as many times as neces- chine model (also referred to as the slicewise model) for 
Mry * exploiting the three levels of parallelism found in some 

BRIEF DESCRIPTION OF THE DRAWINGS 10 ***** parallel machines. The fundamental idea behind 

JJ , the parallel vector machine model of the present in ven- 

Tht foregoing ; and other feature, and advantages of Uon * t0 tave nid|iM ^ ha$ a col|cc P tion rf ^ 

iS^ZS^ ^ T 2f f0l £r 8 m ° r r ™** ° f v «*°' Processors each with its own interface 
particular desenpnon of the preferred embodiments of t0 mem0 thc t0 ^ d 

the invention, as illustrated in the accompanying draw- 15 , tL„t , ... ! / * . >u ™ inc 

ings, in which: p«»j"»e uiaw memory bonleneck associated with the previous Paris 

FIG. 1 shows an example of the artificial memory "tv! „**u*a^„, , u w. . . 

hierarchy present with a virtual processor model; J^^S^SX r 8 * P ™V" 

FIG. 2 shows a system configuration for the mas- "^^S^L!^^ m °?' 2 ™? hme T 

sively parallel CM-2- 20 handcrafting the best possible microcode for the 

FIG. 3 shows a architectural diagram of a sequencer ! * ! ^ like - assembly Ian- 

and a single processing element; ^ a .f th * «> uld * assembled into microcode with 

FIG. 4 shows an example of mapping two one dimen- 8,imlar t P crforman « properties, and then designing a 

sional arrays onto sixteen PE machine; compiler to generate that new assembly language. The 

FIG. 5 shows an example of mapping an array X of 25 ^f^^y language gave us a new model of the 

size 30 onto a 4-PE machine; CM-2, one in which the key element was the FPU 



FIG. 6 shows a slicewise versus a fieldwisc layout; rathcr than thc blt scnal P roccs *°r. The overall result of 
FIG. 7 shows an example of mapping a simple ele- the com P llcr changes has been a 2-3 times increase 
mental program onto the CM-2 by the slicewise com- computation performance on the benchmark codes. 



in 



I. Overview of CM Fortran Array Features 



piler, 30 

FIG. 8 shows an example of mapping a simple ele- 
mental program onto the CM-2 by Paris (Aeldwise); CM Fortran basically consists of Fortran 77 with the 

FIG. 9 shows an example of mapping a simple ele- m *y features o f the ISO Fortran 90 standard added in. 
mental program with a three-operand RHS onto the Although Fortran 90 contains a number of other inter- 

CM-2 by the slicewise compiler 35 Btm B 9x16 useful features, it is the array features which 

FIG. 10 shows an example of mapping a simple ele- ailow parallelism to be exploited. The description of 
mental program with a three-operand RHS onto the Fortran 90 array features below is an abbreviation of the 

CM-2 by Paris (fieldwise). one in Albert et a)., Compiling Fortran 8xArray Fea- 

FIG. 11 shows the structure of slicewise compiler turcs f° r the Connection Machine Computer System, 

1100 through code generator 1150; 40 Symposium on Parallel Programming: Experience with 

FIG. 12 shows an illustration of the slicewise code Applications, Languages and Systems, ACM SIG- 

generator 1150; and PLAN (July 1988) which is hereby incorporated by 

FIG. 13 shows an example of the three levels of paral- reference in its entirety. Some of the more important 

lelism that exist in the CM-2. Fortran 90 array features include: 

DETAILED DESCRIPTION OF THE PRESENT 43 ^LT^^ °' W 

INVENTION if A and B are conformab i e ( Mme shapc) a|Tays of 

The present invention is directed to a software com- any rank, the statement A = B assigns each element 

piler for compiling a computer program wherein the of B to the corresponding element of A. 

computer program is adapted for use with a data paral- 50 A(2:N + I)« B(!:N) assigns the first N elements of the 

lei computer. one-dimensional array B to A in positions from 2 to 

The data parallel computer may be one manufactur- N+ 1. 

ing by Thinking Machines Corporation, such as the A(2:N+ 1)=B(1:N:- 1) assigns the same elements in 

Connection Monchinc® Model CM1 <g), CM2® and reverse order. 

CM 5 ® Supercomputers. These are described in U.S. 55 arithmetic operations on arrays and array sections: 

Pat. No. 4,589,400 to Hillis, U.S. Pat. No. 4,984,235 to B+C indicates an clement wise addition of the arrays 

Hillis et al., and U.S. patent application Ser. No. B and C. 

07/042,761, entitled "Method and Apparatus for Simu- A(2*N:2:-2) • B(3:N+2) calls for multiplication of 

lating M-Dimensional Connection Networks in an N- the described subsections of A and B. In this case, 

Dimensional Network Where M is Less Than N", filed 60 it would produce a one-dimensional array of length 

Apr. 27, 1987, by Hillis, all of which were cited above. N + 1 containing in sequence the values 

Specifically, U.S. Pat. No. 4,589,400 describes a mas- A(2*N)»B(3), A(2*N=2)*B(4) A(2)*B(N+2). 

sively-parallel computer, including one embodiment of relational operations such as A ,EQ. B are allowed. 

processors and router, with which the present invention and return arrays of boolcans. 

can be used. U.S. Pat. No. 4,984,235 describes a mas- 65 masked array assignments. For example , WHERE (B 

sively-parallel computer. U.S. patent application Ser. .NE. 0) B= A/B assigns the quotient of A and B to 

No. 07/042,761, entitled "Method and Apparatus for the non-zero elements of B. 

Simulating M-Dimensional Connection Networks in an array intrinsic functions: 
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elemental array functions, such as SIN, which extend kilobytes of memory. Every 32 bit-serial processor 
scalar operations to elementwise operations on shares a single FPU and a transposer to convert be- 
wrays, tween the bit-serial field wise representation of a number 

transformational functions such as C SHI FT (circular and an word-based slice wise representation. The alter- 

shift) and EOSHIFT (end-ofT shift). 5 native view, which is used by CM Fortran 1.0 (Think- 

array reduction functions such as ALL, ANY, ing Machines Corporation CM Fortran Reference Man- 
COUNT, MAXVAL, MINVAL, PRODUCT. uat Version 1.0 Cambridge, Mass. 02142 (1991 ), is that 
and SUM. Each of these functions has two forms: each PE is a unit containing a DP and associated mem- 
one reducers an entire array to single dimension ory r there arc 2K PE's in the machine, and all data is 
and returns an array of rank one less than its argu- to stored in the word-based slice wise representation. Each 
mem. PE is both superpipelined and superscalar. 

array construction functions such as RESHAPE and Note that the slicewUe compiler does not make any 

SPREAD. use of the bit-serial processors. This acceptable because 

vector-valued subscripts: efficient usage is still being made of the underlying 

A vector integer expression is used to specify an array IS silicon. Not counting memory, each DP along with its 

section. For example, if V is a one-dimensional associated bit serial and routing hardware contains ap- 

array with elements 3, 4, 22, and 6 then proximately 145,000 gates. Of this, 40% are devoted to 

A(V) evaluates to (A(3). A(4). A(22), A(6)J the FPUs and only 5% to the bit serial processors. Al- 

When used on the right-hand side of an assignment though our new compilation techniques sacrifice 5% of 

statement, a vector-valued subscript performs a 20 the gates, the payback is that 40% are used with much 

gather type of operation. On the left-hand side, it higher efficiency than before, 
acts like a scatter. 

There are many other Fortran 90 array features (and B * Tar 8 ct Machine Description 

they are supported by CM Fortran), but these are few FIG. 3 illustrates the architecture of the PE 110 and 

and sufficient to give the flavor of the language. 25 sequencer 120 as viewed by the compiler. No communi- 

II. The Slicewise CM-2 Target Machine ^^^^^^^ broadcast ' u etc ;> 

* shows up in the diagram because communication hard- 

This section will attempt to give an intuitive under- ware is not currently exposed to the compiler, the run 
standing of what the target machine looks like to the time library handles all communication. Referring to 
compiler and how the generated code will work. The 30 FIG. 2 and FIG. 3, from the point of view of the slice- 
key features of the target machine arc that it contains a wise compiler, the CM -2 target machine has the follow- 
large number of vectors processors which can commu- ing specifications: 2048 64-bit PEs 210, 4 megabytes of 
nicate through a variety of communication networks. memory 215 per processor (1 single precision mega- 
The central issues in the slicewise model are array lay- word), 7 vector registers 320 per PE, each of length 4, 
out (mapping of arrays to CM processors), FPU register 35 4 scalar registers 330 per PE, and sequencer processor 
utilization, and memory utilization. 220 controls the PEs 

A. Hardware Configuration ^LT^Jt^ f \1 ™ °" °l 

e several benchmwarks. Using a longer vector length 

The hardware elements of a Connection Machine would allow the overhead of instruction delivery to be 
CM-2 system include front-end computers such as the 40 amortized better, but since the pool of DP registers 
Sun-4 (g) and VAX ® that provide the development available is fixed, there would be fewer vector registers 
and execution environments for the users' software, a with a longer vector length. Reducing the number of 
parallel processing unit which executes the data parallel registers available increased register spills and de- 
operations, and a high-performance data parallel I/O creased performance on the benchmark codes, 
system. The parallel processing unit of a fully populated 45 A vector architecture is used rather than using soft- 
CM-2 consists of 64K bit-serial processors, 2048 Weitek ware pipelining techniques in a scheduler because of the 
WTL6I64 64-bit floating points units, and an interpro- cost of delivering instructions to the PEs 210. Instruc- 
cessor communications system (Douglas et al. f The tion delivery to PEs 210 takes two cycles; one for the 
architecture of the CM-2 data processor, Technical Re- opcode, and one for the registers. By using vector in- 
port HA68-1, Thinking Machines Corporation, Cam- SO structions, we can in effect delivery an instruction per 
bridge, Mass. 02142 (1988)). Although it is called a cycle, because the opcode does not have to be chanted, 
'floating point" accelerator, the WTL6164 FPU is not The time needed for instruction delivery, combined 
limited to performing floating point calculations. It is a with the vector length of 4, results in a best obtainable 
complete ALU, and can perform both integer and logi- slicewise speed of 14 gigaflops, which can be achieved 
cal operations. This ability is crucial to the slicewise 55 on polynomial evaluation. The hardware's peak of 28 
compiler (i.e., the compiler for the slicewise model), gigaflops can only be obtained with a longer vector 
therefore the FPU's will be referred to as Data Proces- length. 

sors (DP) throughout this document. Each DP has 4 Sequencer 220 is a controlling processor that drives 
megabytes of memory (8 gigabytes in the entire system). the PEs 210. It has a number of address registers 350 
A sequencer mediates between the front-end and the 60 that point into the memory 215 of the PEs 210. For 
parallel processing unit. FIG. 2 shows a simplified sys- example, such a register might point to the beginning of 
tern configuration diagram. It leaves out the details of the data for a parallel variable B. One of these pointer 
how multiple front ends can simultaneously control registers is reserved by the compiler for use as a parallel 
different sections of the CM-2. stack pointer. Sequencer 220 has stride registers 360 

The term PE (processing element) in the CM-2 can be 65 which can be used to walk over non-contiguous values 
interpreted in two different ways. In the Paris or field- (such as an array section that contains every other ele- 
wise viewpoint, each PE is a bit-serial processor, and ment of an array, like a (1:N:2)), Sequencer 220 also has 
there are 64K of them. Each bit -serial processor has 128 a number of scalar registers, including a conventional 
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stack pointer into its own memory, a loop iteration sors, each with a small amount of memory, the slice wise 
counter 370 for controlling looping over arrays of data, DM-2 has many processors, each with a large amount of 
and memory 380 to contain the compiled PE program. memory. This change in viewpoint, even without the 
Sequencer 220 uses FIFOs 390, 395 to communicate other improvement of the slicewise compiler (reduced 
with the front end that controls the CM-2. 5 waste of memory bandwidth) can change the perfor- 

From the compiler's point of view, the PE/sequencer mance properties of algorithms. 
300 is replicated 2048 times in a full size CM-2, although For example, consider the fact that each basic pro- 
in fact only the PE (the upper half of the figure) is cessing unit now has 4 megabytes of memory rather 
physically replicated, and the point-to-point lines deliv- than only 128K. Suppose an application has to look up 
ering instructions and data to it from the controlling 10 data in a large table for each element in an array, and 
sequencer 220 are actually hardware fan-out trees that suppose that the table has lOO.OOOentries. The Paris 
broadcast the required information from sequencer 220 model would spread this data out among the processors 
to all 2048 PEs. and require an expensive Cm time) communication call 

Sequencer 220 executes a language called PEAC to access it. With the slicewise view of the machine, one 
(Processing Element Assembly Code). PEAC is a load- 15 can store a copy of the entire table in every PE, allow- 
store RISC-like instruction set that includes vector ing fast local lookup. This is important, because local 
instructions, and was created for use with the slicewise lookup can be 100 times faster than general interproces- 
compiler. It has one addressing mode and about 30 sor communication on the CM. 
instruction opcodes. It must be assembled into CM 

microcode for execution, and as a result a single PEAC 20 c - **rdlel Array Layout 

instruction can take more than 1 cycles (hence the term Array layout is handled by functions in the CMRT 
RISC-like). Of course, there is no reason why a machine run time library. Because layout is not handled at com- 
could not be designed from scratch to execute PEAC pile time, a user can run a compiled program on any size 
directly. For a more in depth discussion of a low level of Connection Machine; the run time library takes care 
programming language that uses vector instructions see 25 of the fat that the array layout will be different when 
co-pending patent application entitled "System and different numbers of PEs are available. For a more in 
Method for Compiling Towards a Super-Pipeline Ar- depth discussion of the CMRT run-time library see 
chilecture", which was cited above. There is nothing copending patent application entitled "System and 
inherent about the parallel vector machine model that Method for Mapping Array Elements to Processing 
requires that the PEAC instruction set be used. Any 30 Elements." 

low-level instruction set that contains vector instruc- Arrays re canonical! y laid out by mapping equal sized 
tions can be used with the present invention. subgrids of arrays onto the parallel processing elements 

Below is a section of code which includes two PEAC (PEs) of the CM. For example, FIG. 4 illustrates the 
vector instructions: allocation of two one-dimensional arrays onto a ma- 

FLODV [P12+offset)stride+ + , V5 35 chine with 16 PEs 410. In this simple case, the machine 

FMULV V5 V3 V7 itself is treated as a one-dimensional array of length 16 

The first instruction lakes the pointer register P12, containing PEs, and equal sized one-dimensional sub- 
adds offset to it, and causes each PE to load in parallel grids are assigned to the processors. This canonical 
the next 4 values form the pointed-to location in its local layout has a number of implementation and perfor- 
memory (separated by stride). Each PE places the four 40 mance consequences. Execution of an entirely elemen- 
values that it loads into its vector register V5. The tal statement, such as one that adds 1 to each element of 
pointer register is then auto-incremented by the + +, in A, becomes a simple matter of giving each PE a vector- 
preparation for execution over the next four values later ized loop to perform the required local computation on 
use. This a common idiom because PEAC code is usu- its subgrids. However, a statement like A = A + B( 1:128) 
ally executed from inside of a strip-mined loop that 45 requires communication to align its operands The rea- 
walks over a vector with a length much larger than 4 son is that the 128 elements of A are spread among at 16 
per PE. The second instruction causes each PE to take processors, while the first 128 elements of B, which is 
the four values in its vector register V5, multiply them 256 elements long, are concentrated in the first 8 proces- 
by those in V3, and produce 4 values that are then sors. Directives are available to describe non-canonical 
stored into V7. 50 layouts, which are useful in performance tuning. 

The CM-2 hardware allows a DP instruction to be A special case of array layout occurs when the num- 
chained to an instruction that loads or stores data to or ber of elements in an array is not a multiple of the num- 
from the DP registers. PEAC and the PEAC Assembler ber of PEs. It is a special case because on a SIMD (sin- 
accommodaies this by allowing two instructions, one a gle instruction multiple data) machine, instructing the 
calculation and the other a load or store, to be specified 55 "extra" processors to turn themselves off when opera- 
on a single line of source code, indicating that they are tions are performed on the small array is itself an opera- 
to be overlapped. The semantics of the overlap are tion that takes time to perform. By handing small array 
those of chaining: it is as if the load was executed fol- specially, the compiler avoids these operations alto- 
lowed by the calculation. In the case of an overlapping get ber. The small user array is simply mapped onto a 
store, it is as if the calculation was executed followed by 60 larger machine array, the machine array contains extra 
the store. Thus, for the PEAC example above, the two padding elements, as shown in FIG. 5. The result is that 
instructions can be overlapped even through the multi- all processors have equal sized subgrids to work on, but 
plication uses the results of the load. some of the subgrids contain padding rather than actual 

Note that the slicewise CM-2 is the same piece of data. Elemental calculations can take place in all array 
hardware as any other 64K CM-2, it is only the view of 63 elements, including the padding elements. Performing 
the hardware that has changed. Instead of 32 registers calculations on the padding is an optimization because 
per DP, there is 7 vector registers 320 plus 4 scalar those extra processors have nothing better to do with 
register per DP. Instead of a huge number of proces- their time. Since by their very nature, elemental calcula- 
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tions are localized to elements, calculations on padding have given each PE 63 elements of each of the 4 arrays, 
data have no effect on the calculations on the user's real In effect, PE-1 710a can simply loop from 1 to 64, and 
data, but any floating point errors that arise from calcu- on interaction i it works on elements A(i), B(i), C(i), and 
lations on the padding data must be ignored or masked D(i). The slicewise compiler actually generates vector- 
out. Communications functions, however, must care- 5 ized loops with a loop increment of 4 rather than a 
fully skip over padding data; it cannot be allowed to scalar loop with an increment of 1. Thus, only 16 itera- 
generate spurious interprocessor messages. tions through the loop are needed to handle 64 ele- 

D Slicewise D»t* Lavout mcnts ' and thc various DP instructions operate on vec- 

U. blicewise Data Layout t0f$ ^^^tK pipelined to cover up overhead. 

FIG. 6 shows two different types of data layout: 10 The Paris implementation makes similar use of vectors, 

slicewise and fieldwise. In the slicewise layout, data is but its vectors are of length 32 because that is the length 

stored in a form that makes it easy for DP 610 to load or that the transposer handles. However, for simplicity, 

store a floating point number in a single cycle. This the loops shown in FIG. 7 through FIG. 10 are shown 

format puts each bit of a 32 bit floating point number in as ordinary scalar loops with a loop increment of 1 

a different bit-serial processor. The fieldwise approach, IS rather than as vector loops. 

however, stores data in a seemingly inconvenient form; The first instruction in the loop body loads B(i) into a 
all of the bits of a floating point number are stored with DP register, R0. The second instruction overlaps the 
a single bit-serial processor 605. The fieldwise form load of C(i) and its addition with R0 into Rl. Next, the 
allows efficient bit-serial computation, but because bit- result of the addition is stored out to B(i), and execution 
serial processors can only output I bit at a time, they 20 of the first the two Fortran statements for iteration i has 
cannot write fieldwise data directly to DP 615. In order been completed. The code to execute the second state- 
to use DP 615 on fieldwise data, fieldwise languages like ment comes next. 

Paris must convert data into slicewise form, perform a Since R0 already contains the value of B(i), there is 

floating point computation, and then convert the data no need to reload it. Instead, D(i) is loaded and the load 

back into fieldwise form. 25 is overlapped with a multiply by R0. Finally, the result 

The conversion operation is transposition, which is is written out to B(i). This code has taken basically 5 

performed on a 32x32 square block of boolean bits by instruction times (1 load, 2 stores, and 2 FLOPS) to 

transposer 630. Each of the 32 bit serial processors 605 accomplish two FLOPS. 

can write into a separate column of the transposer 630 Compare this to the computations required when DP 

hardware. They do this in parallel, but since they are bit 30 registers are not exposed to the compiler, as is the case 

serial processors it takes 32 times steps. Once transposer with the Paris instruction set used by the fieldwise com- 

630 is full, DP 615 is free to read out various rows, each piler. As the code in FIG. 8 illustrates, Paris must con- 

of which contains a complete floating point number. tain a loop to walk over the local data. Since this loop 

The inverse operation converts from slicewise format to is in a runtime Paris function and is not visible to the 

fieldwise format. 35 compiler, it cannot be merged with other nearby loops. 

Note that this time consuming transposition operation Each Paris instruction is memory to memory. All of the 

is only necessary as a prelude to using DP 615 on data arguments on the right hand side of each instructions 

when the native data format is stored fieldwise. By must be loaded into the transposer and then be streamed 

ignoring the bit-serial processors 605 and always using a past the DP. The result array produced by the instruc- 

slicewise memory format, DMF 1.0 avoids the conver- 40 tion must be completely written to memory before the 

sion step. It gains a number of other advantages as well, next Paris instruction can be executed. No DP register 

which are outlined below. values can be passed between Paris instructions. 

Paris operations requires user arrays to be at least In the abstract case shown in the figure, four loads, 

64K in size to even activate all DPs at a low level of two stores, and two FLOPS are required. Thus, three 

efficiency. (About 8 times as many numbers are needed 45 extra loads are required compared to the slicewise ver- 

to cover up various overheads such as transpose time. ) sion of this program. This is the roof motivation for the 

This 64K size corresponds to 1 number per bit-serial slicewise compiler; fieldwise programs are always 

processor. On the other hand, slicewise can activate all memory bandwidth bound. 

DPs simply by having a number per DP: an array of size An interesting side effect of this is that Paris compu- 

2048. 50 tation on double precision values takes about 2 times 

E. DP Register Utilization '? nger °" dou " c P recision numbers as on single preci- 
6 sion numbers, because every computation requires three 
Below is a simple section of elemental code which memory operations (two loads and a store), and mem- 
must be mapped onto the DPs and processors of the ory operations on double precision quantities take two 
CM: 55 memory access cycles. However, slicewise DP corn- 
real array (131072) :: a,b,c,d pute time on double precision values is the same as for 
a=a + b single precision values, because the DPs are 64-bit pro- 
fa =b * d cessors. Typical elemental code blocks include far 
The program involves four large one-dimensional ar- fewer than three memory operations per DP operation, 
rays, each containing 131,072 elements. The first line of 60 So even after memory cycles are counted, slicewise 
the program adds array B to array C elementwise, stor- double precision calculations end up taking only around 
ing the result into array A. The second line is similar, 1.3-1.6 times as long as single precision calculations, 
only it multiplies B and D and stores the result in B. The Thus, the parallel vector machine model has removed 
compiler performs serialization and loop fusion in the memory hierarchy by going to a courser grain 
order to make efficient use of the DP registers. 65 mode). This improves performance. The vector proces- 
FJG. 7 illustrates the work performed by the output tor model eliminates the gaps between the data, as op- 
of the slicewise compiler on the PEs of the CM-2. There posed to the large gaps that appeared in the virtual 
are 2048 PEs available, and the array layout functions processor model. Second, the register files can now be 
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Wilued. Previously, the register files were only used as to executing an entire loop. Thus, the instruction band- 
a buffer to store the data for one iteration since very width requirement is reduced. Vector instructions that 
statement was a in essence a DO loop. Now we have access memory have a known access pattern. If the 
more statements inside the DO loop. Thus, instead of vector's elements are all adjacent, then fetching the 
running at the memory bandwidth, the machine is run- 5 vector from a sei of heavily interleaved memory banks 
nmg at the register file bandwidth, and by utilizing the works very well . -p^ high | atcncy of mitiating a main 
register files on the CM-2, for example, the processor mem0 access versus j^^g a is amortiled 

™ by . tP** f, XW °- u- ' -ccess is initiated for the entire vector 

J£ ^„t™l P lT^ ™ VU ^t VeCt ° r <« h « » ■ single word. Thus, the cost of lately to 

model by generating a DO loop and strip mining it to m < • i , . 7 

map it onfo the vector machine. Strip mining is the 10 m ™ m ™ or * ,$ on ? on ? ' or K thc CTUrc vcct <> r ' 
generation of code such that each vector operation is than once for of lhe vector, 

done for a size less than or equal to the maximum vector . Moreov *r, vector machines can pipeline the opera- 
length defined by the machine. For a detailed explana- 0ons ° n the w* v «*ual elements. The pipeline includes 
tion of strip mining see Loveman, D. B., "Program .* ^ onl y ™ arithmetic operations (multiplicaUon, addi- 
Improvement by So urce-to- Source Transformation," J. ^ and *° on >' but al& ° memory accesses and effective 
of the ACM, Vol. 20, No. l t pp. 121-145. (January address calculations. In addition, most high-end vector 
1977). machines allow multiple vector operations to be done at 

_ _ the same time, creating parallelism among the opera- 

F. DP Memory Utilization tions on different elements. 

Below is a one-line section of elemental code which However, not all machines are designed to center 
must be mapped onto the DPs and processors of the wound the vector processor (e.g., the CM-2). The pres- 
CM: ent invention provides a compiler that compiles a 

source program written in an array-based language 
•-b+c+d 23 ( c ,g, ( CM Fortran 1.0) onto a parallel vector machine 

, model. Array-based languages are fine-grained (i.e., 

Tms program involves the same arrays as the previous they are designed to allow operations and/or manipula- 
example. but now three arrays are being added and the tion on individual bits of arrays), when the grain size or 
a & .i 10 ' Cd m [ hC f T* h * , ■ , granularity is the average task size, measured by the 

wf °; * % t £ W l i " ma PP cd f nto r the 30 numbcr of instructions currently being executed. Fine- 

PEs of the CM by the sl.cewise compiler The first *> grain languages allow the user to subscribe to the paral- 
operand* loaded, then he next two are loaded and , clism mncrcnt in ^eir application. Fine-grain lan- 
overUpped with two additions, and finally the result is ^ c^^^^ d ^ ™. 

™T °, m T th l S p° - f an ^. c< * u ' va,cnt m grain hardware. The parallel vector machine moSd 
FIG. 10. Since the largest Pans instructions have two " __ t *w w„ /L« ^ i I f , : 

input operands and one output operand, the compiler " ZZ ^ « * ^JS^w* do f the hard * 
must miroduce a temporary T to handle the program ™ * h ™ * * ^™ e * ,t 15 W f ^ ^anugeous 
which adds three numbers into a fourth. This Tmust be I 0 /" 6 8 cour *-«™ ed which by its very 
a full sized array in memory, so it therefore takes up 64 ™ ture . ,s morc P°^ crful - Consequently, the present 
words per PE, or 131,072 words when totaled across all *T the , ^B™ned array-based language, 
PEs. This increases the memory used by this simple « m t f ,c V M * 2 e * am P |c CM Fortran 1.0, onto the course- 
program by 25%, and wastes memory bandwidth on gramed vcc,or P roc 1 essor hardware, 
storing and loading data that the user never asked to u . v . languagcs havc bccn ma PP*d onto vector ma ' 
have stored or loaded. chines (i.e., course grained hardware). However, these 
The equivalent temporary for the slicewise program JJf** 1 !! 4 conl ** ncd onlv sin 8 lc vector processors (e.g., 
was the register Rl, which has none of the memory « c k ? C C W* J * 05 ftnd Crav ')* Moreover, some ma- 
wasting properties. Note that even if a register like Rl ? n,nes took thc ParM tVDC of approach in mapping array 
had to be spilled (and this is a rare occurrence), only 4 languages onto the hardware. However, if the designer 
values (the vector length) would have to be spilled per wants t0 gct performance, a different model of 
PE. or 8,192 totaled across of all PEs. This is obviously ,hc m achtne is required. 

preferable to spilling the entire user array. 30 As stated above, the CM-2 is a massively parallel, 

in D ,. , xy ♦ w u- w ^ , superscalar, superpipelined machine. Most computer 

III. The Parallel Vector Machine Model architectures typically do not have ail three of these 

As is apparent from the above discussion, the funda- characteristics at once. However, since the parallel 
mental idea behind the parallel vector machine model is vector machine model is so general, the teachings of the 
to have a machine that has a collection of vector proces- 35 present invention can be applied to a variety of different 
sors each with its own interface to memory. Vector machines. The only requirement is that the target hard- 
machines provide high-level operations that work on war * have access to vector processors, or pipeline pro- 
vectors— linear arrays of numbers. A typical vector cessors that can be configure to operate like vector 
operation might add two 64-entry, floating point vec- processors. Currently on the market are a variety of 
tors to obtain a single 64-entry vector result. The vector 60 different hardware models that can operate with array - 
instruction is equivalent to an entire loop, with each based languages, regardless of whether the target hard- 
iteration computing one of the 64 elements of the result, ware was originally designed to operate with such a 
updating the indices, and branching back to the begin- language. The parallel vector machine model allows 
ning. these same hardware configurations to operate more 

Vector operations have several important properties. 65 efficiently with the array-based language, then without 

The computation of each result is independent of the the model. 

computation of previous results. A single vector in- Referring to FIG. 13, a general illustration of the 

struction specifies a great deal of work— it is equivalent principle idea behind the present invention is shown. A 
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plurality of processors 13Kb, 13206, 1320c are shown. can write in easily, while still making efficient use of a 
There can be any number of processors (e.g., the CM-2 variety of different types of hardware, 
has 2K FPUs). Each processor 1320 has a plurality of There is nothing inherent about the parallel vector 
functional units. Only five functional units 1310c- 1310* machine model that inhibits it from distributing the data 
are shown for simplicity. However, the present inven- 5 at compile time. All that is required is to have a media- 
tion contemplates using any number of functional units. nism for telling the compiler how many processors are 
Each functional unit 1310 is pipelined, and as such, each in the system. For the most part, it can be done at corn- 
functional units 1310 takes a different number of cycles pile time. However, if the array size is unknown at 
to execute. For example, the add takes less cycles then compile time, then the data must be distributed at run- 
a load store operation. Heretofore, compiler designers 10 t " ne 

tried to manipulate the pipelines in the functional units Th* parallel vector machine model is a superior 

to try and compensate for the difference in cycle time model than the Paris model, because it allows the pipe- 

for different instructions. Efficient utilization of the linc aspects of the processor 1320 to be exploited. A 

functional units pipelines results in faster processor pipelined processor allows the latency for one operation 

speeds. The object is to keep every "box" in the pipeline 13 to hidden, so long as the processor is working with 

busy. multiple operations. If the pipeline is to deep there are 

The parallel vector machine model solves this prob- drawbacks. For example, in the CM-2, the pipeline is a 

lem by treating each functional unit as if it had a length function of the number of registers. Thus, as the pipeline 

equal to the longest pipeline. For example, in FIG. A « cts <*«peT, the user has access to less vector registers, 

the multiply (•) functional unit a pipeline that is seven 20 Consequently, the preferred embodiment of the present 

units deep. Thus, all five functional units will be treated invention uses a pipeline of size four, 

as if they were seven units deep. The effect of this is that *** P™ Icl vector machine model is a substantially 

the machine will lose some of the efficiency due to the opener for array languages as opposed to the virtual 

lengthening of the pipeline down the pipeline, but it „ P r ? c ? >or modd. becwae in the virtual processor model 

gains because of the efficient use of the functional units. 25 " »««P««We » ™hze the register files inside the 

Typically, the goal of designing a processing mecha- fl J^J! omt "S^-!""*. ? nly ° nC rc *£ Cr fl L c was 

nism is to try and design an architecture that will allow ^ T^^" ^ W ' 

the individual processors to nan at the highest possible ^ J^"^ 1 " 'j™ 1 * ^y the memory to mem- 

speeds. One common problem is memo£ bottleneck- M iSn^K • ? P v ^ 

^^Transferring data between the processors and 30 ET^^/^ 

memory takes time. THe more memory transactions that S^^*^ ^ 10 

*l i -I.. ^ w -» go beyond the memory bottleneck, the register files (or 

are necessary the slower the machine will be. The CM-2 J ^ mu$t ^ » 

can only transfer four bytes of data from the Pressors ^ tne t invcntion is doin thm h 

to memory per cycle. The only way tobreak this bottle- 35 |oit$ the parallc|ism across tnc vcclor p™^,* ftnd 
? W J^f I t?f ?* Tlie present inven- utUi2cs thc Uter f|lc$ ^ ^ processors, 

tion decided to break the bottleneck problem with regit- ^ paralld vcctor machjnc modcl al , ows * t 
ters^The present invention could be implemented wi h to hide the mcmory , at nide tfte , at of ^ 
caches, however, caches can not be controlled as easily dc , jvcry of in$lructions , go lnc ^te- 

as registers. , , . 40 neck, and utilize the multiple functional units in chain- 

Although the use of registers aided in solving the < wg thc opcrations ^ m instruction. By mapping 
problem of memory bottlenecking other problems bleaks of the arrays onto the course grain hardware; the 
arose. To begin with memory latency was a consider- memory hierarchy is removed. 

able problem. The target machine went from a model ^ teaching can be applied to a variety of different 
that did not have any resisters to a model that did have 45 hardware configurations. It is by no means limited to 
registers. As a consequence, a number of actions had to the hardware of the CM-2. For example, the Intel i860 
be taken. To begin with, the present invention strip microprocessors, which is essentially a scalar processor, 
mines the do loops. Second, a new low-level instruction m uli | irc tne teachings of the present invention A 
set that include vector instruction had to be developed. compiler can be built to generate the PEAC assembly 
In essence, the present invention is a virtual model of » language or an equivalent low level assembly language 
a machine. The present invention maps fine-grain array w j t h vector instructions for the i860 microprocessor 
operations onto a course-grain hardware. The proces- Thus, treating the scalar processor as a vector proces- 
sors are complicated processor with multiple functional sor. 
units and multiple pipelines. As stated above the idea 

behind the parallel vector machine model is to go be- 55 I v - Compiler Structure 

yond the memory bottleneck of an individual processor, FIO. 11 illustrates the structure of the slicewise corn- 
while at the same time, distribute the data so that the piler 1100 up through the Code Generator 1150. The 
machine can utilize all of thc individual processors si- compiler can accept multiple input languages; this dis- 
multaneously. cussion focuses on CM Fortran (CMF). The compiler 

The run-time system deals with distributing the data 60 can output three types of code; this document focuses 
from the source program across the different vector on the slicewise output (i.e., PEAC assembly language), 
processors. The data can be distributed at compile time. FIG. U shows the structure of the slicewise code gen- 
A small block of data is stored in every vector proces- erator 1150 through the generation of an executable 
sor. The compiler views the hardware associated with Unix file. 

the target machine as a collection of vector processors. 65 The Fortran Front End (FFE) 1110 of the CMF is 
Thus, each vector processor has a portion of an array built with a compiler building tool called FEAST that 
mapped onto it. The present invention allows the devel- work much like yacc and lex (available from UNIX). It 
opment of a programming language that programmers parses the input program 1105 into an abstract syntax 
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tree, performs semantic analysis to annotate the tree loops must perform an accurate dependency analysis 

whh type information and build up a symbol table, and rather than an overly conservative analysis, or it will 

also performs error check. The output is a tree structure not find anything to parallelize, 

called Common Intermediate Representation (CIR). Its However, the CMF compiler 1100 does not yet paral- 

structure reflects the syntax of the user's code. $ lelize scalar loops. Rather, it only utilizes the parallelism 

A build phase of 1130 of compiler 1100 performs a that is explicit in Fortran 90 notation. An overly conser- 

bottom up walk of the CIR syntax tree, and outputs a vative analysis does not inhibit parallelization. it merely 

graph structure (rather than a tree) called LIR flowered means that an additional subgrid loop will be required, 

Intermediate Representation). The LIR reflects the which increases overhead and decreases the range over 

semantics of the user code, rather than the form. For 10 which values can remain in DP registers. Thus, even if 

example, although the CIR might reflect simply an the compiler's analyzer always returned TRUE when 

assignment statement that adds two array sections to- asked if a dependency was present, the output code 

gether, the LIR might involve communication to align would not become serialized; at worst it would degrade 

one section with the other, a numeric type conversion to being much like Paris, with a separate subgrid loop 

from integer to real, addition, and perhaps communica- 13 wrapped around every operation, 

tion to send the result to the target array section. The Middle End's 1140 dependency analysis makes use of 

LIR produced by Build 1120 for parallel computation two simple rules: (I) any communication is a possible 

consists of nodes that call for computation of whole loop-carried dependency within the subgrid loop. Thus, 

"5^"?. r?! toriration ,*? performed *V Ph*«- a shift is conservatively treated as if it always resulted in 

Build 1120 inserts explicit nodes to perform commu- 20 a dependency, but adding two arrays together element 

nications where it needs to align arrays and/or array wise is not a dependency; and (2) any scalar is a depen- 

sections that have different array layouts. It chooses dency. 

among a large number of communication functions to The vectorized subgrid loops generated by Middle 

do this, attempting to pick the least expensive combine- End 1140 (andi ther cfore, by compiler 1100) cannot 

tion. Build 1120 also has the freedom to choose where 25 dependencies. The end result is that the loops 

(in which processors) to perform a parallel computa- elemental cod e that operate on aligned arrays 

tion; normally it will compule in a layout that is idem.- within a single array layout. Scalar code is handled by 

« °M n i C i™ T . W ? r£.- Urget de ? l 7 ,,on ««"Ply Berating the appropriate front end code. Com. 

Bmld 1120 contains a Local Optimizer which per- munication code is h^led b cal|i the ron time 

forms a number of classic scalar optimizations within 30 library function that will cause the CM-2 to perform the 

basic blocks, including common subexpression el.mina- corresponding communication function. 

tion, copy propagation (of constants, variables and Mow in TABLE , is , , e 

expressions), constant foldmg. useless assignment ehmi- which forms ^ eleinenta] a „d transformaiLat 

nation and a number of algebraic identities and strength Fortrwl ^ operations on both one and two dimensional 

reduction trmsformations " arrays. The source program will be used to demomtrate 

The Gtobal Opnmizer 1130 is an optional phase how MjtWk ^ ^ pf hj 

which performs standard comp. er optinuzauons ,ke e wi]| ^ P med . pseud0 . LIR tha 

copy propagation, strength ."duct.on.and dead/useless clo J tQ Foftran ^ (han L f R ^ * 1 « 

code elimination just like the Local Optimizer in Build ^ .*am 

■ • . 4 « -i • , , _ more readable. 
1120, only it operates on the compilation unit as a whole 40 

and can move code between basic blocks (see Chow, F. TABLE 1 

C, A portable machine-independent global optimizer- integer, imydooo) :: *,c 

design and measurements, Technical Report No. integer, ■royucuoo) - z 

83-254, Sunford University, Computer Systems Labo- ^j 1 - 1 + ah * % < 8#c - *"»-i.»Wn-i> 

ratory ( Stanford Calif. 94305-2192 (1983)). It also per- 45 ™ 0 m 

forms code motion in order to move computation, z - o 

loads, and stores out of loops, partial redundancy elimi* — — — — ^— _ _ 

nation on multi way branches, hoisting of duplicated -n,.^ , - .. . 

code out of condiiionals, and so on The output of J^^T^r ^ t ? * 

Global Optimizer 1130, like its input, is an LIR. 50 \ i TT. £ 

The purpose of Middle End 1140 is to map the LIR, * f 'f PPc ™« u £ ^ ^V™ *** * ^ 

which crates on arrays as monolithic, indivisible ^^^W ^ ^ 

units, intfthe more detailed LIR' (LIR prime). LIR' < } ™ tn »* 

induces explicit code to loop over the elements of an ^L^f* ^ "V* ?T 

array's subgrid. The LIR' alio marks which code will 55 SJSiS,hi«^ 1ST" ^ *i * 

be running on the front end processor, and which code ^^u^^Z dOCS n0t X ° ** any $CB,ar 

will run on the scquencer/PEs of the CM-2. This simpli- 5 cxam P ,e ' : 

fies the jog of the slicewise Code Generator 1150, TABLE 2 

which must produce SPARC or VAX assembler for the integer. •my(tOOO) - •,c.iempi.iemp2 

front end and PEAC assembler for the sequencer/PEs. 60 integer, imy<20.20O) :: i 

Middle End 1140 makes use of a simplified, conserva- " • , u . . 

tive form of dependency analysis. In a serial Fortran 3? J ~ " 

loop, a dependency occurs when a value produced mult « result * 2 

during one iteration of a loop is needed by another ■ - ° 

iteration, or when one iteration overwrites a location 63 ' c 0 
whose old value is needed by some other iteration. 

Dependencies inhibit parallelization. Therefore, a com- Note that two new array temporaries have been in- 

piler thai relies upon automatic paral Iclization of scalar troduced. tempi is used to separate out the elemental 
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computation that was buried within the csdhift commu- 
nication, while temp2 is used to separate the computa- 
tion that was performed on the result of the communi- 
cation. 

Next, in TABLE 3, Middle End 1140 begins to sepa- 
rate the code that will run on the front end from the 
code that will run on the sequencer. A new type of LIR 
node, the PECODE, signifies the point of separation 
between the two. The computation of array layout is 
shown explicitly as well in this table. (It was actually 
generated in Build phase 1120.) Although the PE code 
has been indented to the right, there is still only a single 
LIR graph data structure (not the matching of the clos- 
ing PECODE parentheses). 

TABLE 3 



20 

TABLE 5-continued 



POP_loopsire(1injit); 
subgrid— Joop{i« 1, limit) 
{ - 0: > 



10 



CODE FOR FRONT END 


CODE FOR PEs 


shape) - tmy-Uyout([IOOO]>: 




sh*pc2 - mt»> ;;.yout([2(U0O]>; 




. . .allocate space for 




parallel variables, then . . . 




PECODE_l<ahapel, 


temp) = a • c) 


temp2 -> cthifMaiempl, dim«= t .shift** 1 




PECODE_2(ihapcl. 






mull — 1 -+ tcmp2 
result — result * 2 






• - 0 


PECODE_3<shapc2. 


z - 0) 



In effect, the basic work of Middle End 1140 illus- 
trated above is to "detect" the parallel loops of Fortran 
90 and to perform loop fusion and strip mining. A num- 
ber of other optimizations are also performed. 

One important optimization is the application of copy 
propagation within PECODEs. Both the Build phase 
1120 and the Global Optimizer 1130 performed coy 
propagation, but only on scalar computations. The rea- 
15 son is that in general, full dependency analysis is re- 
quired in order to make sure that copy propagation is a 
legal transformation. However, within PECODEs, 
there are guaranteed to be no loop-carried dependen- 
cies, so copy propagation within a single PECODE is 
always legal. In the previous example, Middle End 1140 
would have performed copy propagation on the body 
of PECODE—2 as shown in Table 6, eliminating a store 
of result to memory. 
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The final step in Middle End 1140 processing is to 
insert explicit code by which the sequencer communi- 
cates to the PEs the information needed to execute a 
PECODE block. Basically, this consists of the ad- 
dresses of the parallel variables, the values of any scalar 
variables, the number of elements in the subgrid (which 
is a part of the array layout data structure at runtime), 
and a parallel stack pointer for register spills. This infor- 
mation is passed as part of a PE—FUNCALL node. 
This function invocation node names the PECODE 
that should be executed, and pushes the appropriate 



25 



30 



TABLE 6 



iubgnd_loop(i = I. limn) 
{ result ■= 1 + lemp2: 
result = result • 2; } 
■r « > via copy propagation « = > 
subgrid_loop0 m |, limit) 

{ result «(I4 temp2> • 2; } 



Another optimization that Middle End 1140 performs 
in order to increase the size of the subgrid loops is code 
motion: scalar code that sits between two blocks of 
35 elemental code will be moved out of the way, if the 
compiler can prove that the code motion will not vio- 
late any dependencies. This allows the two elemental 
blocks on either side of the scalar code to merge into 
one large block. The larger subgrid loop that results 

aVgumemrinlo The FIFO p£ that' connects' iktot 40 flX!!L^f. ^fl^^tlf^^ 1 ^^!?^^^^/^ 10 and PE 
end to the sequencer. The output of Middle End 1140 at 
this stage is LIR', a single data structure, but for clarity 
it is split into two separate figures. Table 4 contains the 
front end code, and Table 5 contains the PE code. 

TABLE 4 

thapel - array -lay out ([1000]): 
*hape2 — arriy-layout([20,200]) : 
. . .alloc space, then . . . 
PE_FUNCALL(APECODE_l, 

tempt, a, c, PE— SP, 

shape 1 • > subgrid_stie); 
CMRT_CSHIFT(temp2. temp), 1,1)-, 
PE_FUNCALL<APECODE_2, 

result, tempZ a, PE_SP, 

thapcl->subgrid_siic); 
PE_FUNCALL<atPECODE_3, 

x, PE_SP, 

shape2->subgrid_sue); 
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TABLE 5 



PECODE-I: POP ARG S(tetnp 1 » a, c, PE_SP; 

POP— loopsixeflimi \ ) : 
subghd_loop(i » 1, limit) 
{ tempi [il- »[ij»c[i];} 
PECODE-2: POP_ARGS( result, terop2, a. PE-SP); 
POP_kx*psize{limiO; 
iiibgrid— loop(i=1. limil) 
{ resuh « I + temp2: 
result o result • 2; ) 
PECODE-3: POP_AROS(i,PE_SP>; 



Scheduler phase 1220 (shown in FIG. 12) more oppor- 
tunity for optimization. In keeping with the flavor or 
minimal dependency analysis, the compiler only moves 
scalar code that il inserted; user code is not moved. 
Examples of the kid of scalar code the compiler can 
move includes dope vector manipulation, allocation of 
space for parallel arrays, and allocation of array layout 
geometries. 

Middle End 1140 performs an unmasking transforma- 
50 tion which converts the Fortran WHERE statement 
into an explicit vector merge, using vector versions of 
bit wise XOR, AND, and NOT to implement the 
masked array assignments in the body of the WHERE. 
This transformation allows the RHS of assignment 
55 statements in the body to be evaluated without a mask; 
it is only the assignment to a target that is masked. This 
is desirable because unconditional execution is faster on 
the CM-2 (and on many vector architectures) than con- 
ditional execution. It is legal for CMF to evaluate the 
60 RHS expressions in an unmasked manner because of the 
helpful semantics of Fortran 90: Fortran 90 does not 
contain a way for the user to express side effects on the 
RHS. If an array valued function call is on the RHS, it 
is defined by the standard to execute in an unmasked 
65 manner. 

Finally, Middle End 1140 performs an optimization 
known as temporary compression, in which compatible 
array temporaries whose lifetimes do not overlap are 
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merged into one. This saves memory, and also reduces 
the number of calls to allocate/deallocate routines. 

Code Generator 1150 of the CMF compiler 1100 is 
based on PQCC technology (Lcverett el a]., Computer 
13(8);38-49 (1980)). It performs a constrained pattern 
matching walk of the LIR' graph. When a pattern and 
its constrains match, an action that outputs appropriate 
code is fired. For example, if x and y are scalar floats 
that are in registers, a pattern will match as follows: 

PLUS^y) 

= c= >CGN outputs to scalar stream = = > 
FADD Rx, Ry, R output 

The fact that the output is available in Routput is 
added to the context of the PLUS LIR' node, thereby 
affecting the continued pattern matching of nodes. 

The CGN produces two entirely separate code 
streams. All code other than that inside PECODEs is 
sent to a scalar code stream. The code sent to the scalar 
code is either SPARC or VAX assembler, except that 
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alternately load another variable (c, d, e, etc.) and adds 
it into the sum. In line 16, a vector four sum values is 
written out. Lines 17-23 perform the multiplications 
specified by the user program. No loads are needed 
because all of the data has been loaded into registers 
already. The final PEAC statement stores a vector of 
four multiplied values. 

The PEAC code is executed repeatedly as the core of 
a subgrid loop. On each execution is processes four of 
the elements of the subgrid, so if the subgrid is 16 ele- 
ments long it will execute four times. 

This PE Register Allocator 1250 phase accepts vir- 
tual register PEAC as its input and replaces the virtual 
registers with actual DP registers and sequencer pointer 
registers. Spill and unspill code is inserted as needed. 
Finally, the last use of each pointer register in the loop 
is convened to use autoincrement mode, so that the 
pointer is ready for use in the next iteration of the sub- 



the code assumes an infinite number of registers are 20 3™* loo P TABLE 8 shows the output of PE Register 
available. A later Register Allocator phase will map * * 



these virtual registers onto the physical registers. Simi 
larly, all code that is inside a PECODE is sent to a PE 
code stream in the form of virtual register PEAC. 

The purpose of the four post code generator scalar 2 5 
FE phases 1210-1240 shown in FIG. 12 is to take the 
virtual register assembly code that is intended for exe- 
cution on the front end and actually generate files con- 
taining object code for the front end. These PE phases 
1210-1240 arc fairly conventional. jq 

The four post code generator parallel PE phases 
1250-1280 shown in FIG. 12 take code intended for 
execution by the PEs (written in PEAC assembly code) 
and produce object files that can be downloaded into 
the CM -2 sequencer. The phases will be explained using 35 
the example Fortran and virtual register PEAC code in 
Table 2. 



Allocator 1250 for the same program. 

TABLE 8 



TABLE 7 



flodv 


1*P2+0]I + + »V0 


flodv 


ltPJ+0]l + -h iVl 


bddv 


•V0.V1 «V2 


nodv 


|iP4+0]l + + aV3 


foddv 


*V2 »V3 aV4 


fttrv 


aVf[BP10-fO]1 + + 


ftnulv 


aVOaVI a V4 


finulv 


aV4 aV3 aVO 


ftnulv 


aVO aV5 aVl 


ftnulv 


•VI »V2 »V3 


R flodv 


(aP74-0]l + + aV4 


ftnulv 


aVJ iV4 »V0 


R nodv 


[aP8+0]l + + »V5 


ftnulv 


aVO »V5 aVl 


ftnulv 


aVl aV6 aV2 


fttrv 


aV2 [aPH + 0)J+ + 



( -=> 



real. array(10)::a,b,c.d,e.f,g.h,ij 
a« b + c+d+c+fVg+h + i 
j= b»c»d»e # dVh*i 
via CMF compiler, produce virtual register PEAC > 



1 


nodv 


(vPl:l+0]1 vV1l:1 


2 


flodv 


ivP2:]+0]l vV12:l 


3 


faddv 


vVtl:l vVlJ:| vV13:l 


4 


flodv 


[vP3:1+0]l vV)4:l 


5 


faddv 


vV!3:l vVI4:I vVI5:l 




. . . etc., alternating loodi and adds . . . 


16 


fitrv 


vV25:l fvP9:l+0]l 


17 


ftnulv 


vVll:l vV12:| vV26:1 


IB 


fmulv 


vV26:l vV14:l vV27:J 


19 


finulv 


vV27:J vV16:l W28:l 


20 


ftnulv 


vV28:l W1S:1 vV29:l 


21 


fmulv 


vV29:l vV20:1 vV30:l 


22 


fmulv 


vV30:l vV22:l vV31:l 


23 


fmulv 


vV3l:l vV24:l vV32:l 


24 


furs 


vV32:l [vPI0:l+0]l 
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PE Register Allocator 1250 uses a combination of 
"on-the-fly* 1 register allocation with some intelligence 
about what values are needed at what point. This addi- 
tional intelligence is of two forms: spill code is not gen- 
erated for registers that contain values that arc known 
to already be in memory, and exact next use information 
is calculated prior to register assignment. The reason 
that an on-the-fly algorithm produces acceptable code 
is that the structure of the PECODE is so regular and 
predictable: it consists of exactly one basic block, and 
the last statement is in effect a goto to the beginning (the 
PECODE is in the body of a subgrid loop). A more 
complex graph-coloring algorithm would probably not 
produce better output code because this simple algo- 
rithm has been so tuned to its relatively simple task. 

Notice that the output now refers to the physical 
registers in the architectural diagram of FIG. 3. For 
example, the virtual vector register vV32 has been 
mapped into the actual vector register a V2, and the 
virtual pointer register vP2 has been mapped into the 
60 actual pointer register aP3. It turns out that there were 
not enough vector register to hold all of the user vari- 
ables. As a result, two unspill* were inserted, on lines 
marked with an "R". These u as pi lis reload vectors into 
registers that had been temporarily preempted for some 
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The Fortran program in TABLE 7 elementally adds 
a number of arrays together, and also multiplies the 
same arrays together. The PEAC code begins with two 
loads in lines 1 and 2, which load virtual vector registers 
vVll and vV12 with the values pointed to by two vir- 
tual sequencer pointer registers vPl and vP2. The offset 
added to the pointer is zero, and the stride between the 
pointed to values is one. Earlier POPS that are not 
shown had set those pointer registers to point to the 

beginning of the subgrid for a and b. Thus, a vector of 65 other purpose. Notice that no vector spills were in 
4 a values and 4 b values are loaded. serted. PE Register Allocator 1250 almost always 

The third PEAC statement adds those values to- chooses to preempt values that already available in 
gether. There follows a sequence of statements that memory, so that it can unspill (or reload) the value 
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without having to waste memory bandwidth on saving 
it. 

PE Scheduler 1260 is an optional phase that takes the 
output of PE Register Allocator 1250 and transforms 
the code to overlap loads and stores. This generally 
results in a 30 to 40 percent increase in performance. 

A classic compiler problem is determining which of 
the Scheduler 1260 and Register Allocator 1290 phases 
should mn first. Each of two possible ordering has its 
benefits. If Register Allocator 1250 is run first, the spill 
and unspill code that is produces can be scheduled, 
which is important for performance when registers are 
in short supply. The slicewise architecture has 7 vector 
registers, which is a very limited number and leads to 
the use of many unspills for large expressions. However, 
in mapping independent virtual registers onto a smaller 
set of physical registers, Register Allocator 1250 intro- 
duces false dependencies which might prevent Sched- 
uler 1260 from moving code and producing optimal 
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any loop carried dependencies. Scheduler 1260 uses the 
virtual-to-physica) register map information to provide 
the independence of loads and stores that use the same 
physical register but in fact arise from different virtual 
5 registers. 

PE Peephole optimizer 1270 is an optimizer for lo- 
cally improving the target code. Peephold optimizers 
generally improve performance of the target program 
by exainining short sequences or target instructions 
10 (called the peephole) and replacing these instructions 
by shorter or faster sequence, when possible. See Aho, 
et al. t cited above, pgs. 554-557. 

PEAC Assembler 1280 is a table driven translator 
that takes PEAC and turns it into calls to CMIS (Con- 
15 nection Machine Instruction Set) microcode (Thinking 
Machines Corporation CMIS Reference Manual Cam- 
bridge, Mass. 02142 (1990)). It hides the cumbersome 
details of CMIS, such as they way that the setting of the 
DP instruction opcode first takes effect several cycles 
output code, If Scheduler 1260 is run first, it can pro- 20 after it is changed by a CMIS program, so the control of 
duce optimal code, but then the spills/unspills will not the DP opcode pins must be carefully interleaved with 
be overlapped. other code. 

The ideal solution is to combine both phases into a CM Run Time (CMRT) library 1290 provides the 
single phase, but this is not practical for this part of the communication functionality that previously was avail- 
compiler from a software engineering point of view, 25 able through Paris, but it operates on slicewise data. 
The combined phase would be too large and too com- Object modules 1285 produced by CM Fortran com- 
plicated. Given the performance benefits of scheduling piler 1100 are linked with CMRT library 1290 in order 
and overlapping the spill/unspill code, consequently, to produce an executable file 1297. CMRT 1290 entry 
the preferred embodiment of the present invention point functions are all on the front end, but the imple- 
chose to run Register Allocator 1250 first. However, in 30 mentation of CMRT 1290 includes functions that run on 
addition to the register-allocated PEAC, Scheduler the sequencer as well. 

1260 receives as input retained information about the The first class of CMRT 1290 functions contains 
original virtual registers. This allows it to compute the those that handle array geometries and thereby allow 
true dependency information, and to undo false depen- the mapping of arbitrary array shapes onto the CM -2. 
dencies introduced by Register Allocator 1250, if neces- 35 The second class includes communication between the 
^y- front end and the PEs, including the calling of 

Table 9 shows the output of PE Scheduler 1260 for PECODE functions, scalar broadcast, parallel reduc- 
the sample program. Every load and store other than tions to scalers, and the serial reading and writing of 
the first has been overlapped. In a sense, the register individual array elements. The third and final class 
unspills that were inserted by PE Register Allocator 40 includes the interprocessor communication functions, 



1250 turn out to be free, because they do not take away 
any resources that could be used for some other pur- 
pose. (No other loads could be overlapped even if the 
unspills were eliminated.) 

TABLE 9 
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The scheduling algorithm is a heuristic rule-based 
system. Subject to data dependencies, it is free to move 
instructions anywhere within the PECODE; it does not 
have to maintain ordering between code that came from 

different user statements as long as it can provide there 65 register allocators are covered in Goodman et al 
is no dependence (input or output). This is relatively 
simple to do, since each PECODE consists of a single 
basic block which is looped over, and there are never 



such as grid shifts and routes, scans and spreads, and 
general permutation functions. 

When a CM Fortran program is executed, the front 
end begins by calling a CMRT function to initialize the 
CM -2. Upon entrance to a subroutine, CM-2 memory is 
allocated for any arrays declared in that subroutine, and 
CMRT functions are used to create array geometries 
that map arrays onto the PEs. When a PECODE is 
called for the first time, the CMRT_FUNCALL func- 
tion loads the CMIS microcode produced by the PE 
Assembler for that particular PECODE into sequencer 
memory and then invokes that microcode. If the se- 
quencer runs out of memory, older PECODE micro- 
code may be invalidated and overwritten (using an 
LRU algorithm) to create space. 

As excellent description of numerous vectorization 
techniques, both conventional and new, is contained in 
Wolf, Optimizing Supercompiler for Supercomputers, The 
MIT Press, Cambridge Mass. (1989). The scheduling 
60 techniques are based on work in Gibbons et a].. Efficient 
Instruction Scheduling for a Pipelined Architecture, Pro- 
ceedings of the ACM SIGPLAN 1986 Symposium on 
Compiler Construction, SIGPLAN Notices 21.6, pgs. 
11-16 (June 1986). Interactions between schedulers and 

. Code 

Scheduling and Register Allocation in Large Basic Blocks, 
Proceedings of the International Conference on Super- 
computing pgs. 442-452 (July 1988), and treatment of 
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groups of registers as vectors is discussed in Jouppi et 
al.» A Unified Vector/Scalar Floating Point Architecture, 
Architectural Support for Programming Languages 
and Operating Systems, pgs. 134-143 (1989). 

While the invention has been particularly shown and 5 
described with reference to preferred embodiments 
thereof, it will be understood by those skilled in (he art 
that various changes in form and details may be made 
therein without departing from the spirit and scope of 
the invention. 10 

What is claimed is: 

1. A computer implemented method of compiling a 
fine-grained array based source program written for a 
parallel machine, comprising the steps of: 

(1) entering said source program into a front end of a IS 
compiler, wherein said front end produces a com- 
mon intermediate representation (CIR) syntax tree; 

(2) building a lowered intermediate representation 
(LIR) from said CIR syntax tree by performing a 
bottom-up walk of said CIR syntax tree; 20 

(3) mapping said LIR into a more detailed LIR', said 
LIR' marks which code will run on a scalar front- 
end processor and which code will run on a plural- 
ity of parallel vector processors; and 

(4) generating two separate code streams from said 25 
LIR'. 

2. The method of claim 1, further comprises the step 
of optimizing said LIR. 

3. The method of claim 1, wherein said plurality of 
parallel vector processors are functionally identical. 30 

4. The method of claim 1, wherein said parallel vec- 
tor processors form a coarse-grained hardware. 

5. The method of claim 1, wherein said step (4) gener- 
ates a scalar code stream that will operate on a scalar 
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operates on a coarse-grained hardware that pro- 
cesses said code concurrently and utilizes at least 
one register file, wherein said coarse-grained paral- 
lel machine contains a plurality of vector proces- 
sors, said plurality of vector processors contains a 
plurality of functional units, the pipeline of said 
plurality of functional units being of equal or un- 
equal length; and 
(3) treating said at least two functional units, if said at 
least two functional units have a pipeline of un- 
equal length, as having equal length. 

11. A computer implemented method of compiling a 
fine-grained array based source program written for a 
parallel machine, comprising the steps of: 

(a) front end means for receiving a source program, 
wherein said front end means produces a common 
intermediate representation (CIR) syntax tree; 

(b) building means for building a lowered intermedi- 
ate representation (LIR) from said CIR syntax tree 
by performing a bottom-up walk of said CIR syn- 
tax tree; 

(c) mapping means for mapping said LIR into a more 
detailed LIR', said LIR' marks which code will run 
on a scalar front -end processor and which code 
will run on a plurality of parallel vector processors; 
and 

(d) generating means for generating two separate 
code streams from said LIR'. 

12. The system of claim 11, further comprising opti- 
mizer means for optimizing said LIR. 

13. The system of claim 11, wherein said plurality of 
parallel vector processors are functionally indentical. 

14. The system of claim 11, wherein said parallel 



machine and a virtual register code stream that will 35 vector processors form a coarse-grained hardware 



operate on a parallel vector machine. 

6. The method of claim 5, wherein said second code 
stream includes vector instructions which are used to 
control said parallel vector machine which generally 
processes said vector instructions concurrently. 

7. The method of claim 5, further comprises the step 
of generating first object code for a scalar processor and 
second object code for said plurality of parallel vector 
processors. 

8. The method of claim 7, wherein in said step of 45 
generating object code for said plurality of vector pro- 
cessors, comprises the steps of: 

(a) replacing the virtual registers with actual data 
processor registers and sequence pointer registers; 



15. The system of claim 11, wherein said generating 
means further comprises a first means for generating a 
scalar code stream that will operate on a scalar machine 
and for generating a virtual register code stream that 

40 will operate on a parallel vector machine. 

16. The system of claim 15, wherein said second code 
stream includes vector instructions which are used to 
control said parallel vector machine which generally 
processes said vector instructions concurrently. 

17. The system of claim 15, further comprising sec- 
ond means for generating first object code for a scalar 
processor and second object code for said plurality of 
parallel vector processors. 

18. The system of claim 17, wherein said second 



(b) transforming the code generated in step (a) to 50 means further comprises generating object code for said 



overlap loads and stores; 

(c) optimizing the code generated in step (b); and 

(d) generating microcode from said optimized code 
of step (c). 

9. The method of claim 7 t wherein said first object SS 
code and said second object code are linked with a run 
time library in order to produce executable code, 
wherein said run time library operates on slicewise data. 

10. A computer implemented method of compiling a 
source program for coarse-grained parallel machine, 60 
comprising the steps of: 

(1) receiving a source program written in a fine- 
grained array-based programming language; 

(2) translating said source program into target code 
which includes vector instruction, said target code 65 



plurality of vector processors, comprising: 

(a) means for replacing the virtual registers with ac- 
tual actual data processor registers and sequence 
pointer registers; 

(b) means for transforming the code generated in step 

(a) to overlap loads and stores; 

(c) means for optimizing the code generated instep 

(b) ; and 

(d) means for generating microcode from said opti- 
mized code of step (c). 

19. The method of claim 17, wherein said first object 
code and said second object code are linked with a run 
time library in order to produce executable code, 
wherein said run time library operates on slicewise data. 
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